video clips
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Diagnostic Medicine (0.93)
- Information Technology (0.67)
ViLCo-Bench: VIdeo Language COntinual learning Benchmark Tianqi Tang
For what purpose was the dataset created? To address this, we propose ViLCo-Bench. Who created the dataset(e.g., which team, research group) and on behalf of which Who funded the creation of the dataset? What do the instances that comprise the dataset represent (e.g., documents, photos, What data does each instance consist of? Is there a label or target associated with each instance?
- Oceania > Australia > New South Wales (0.05)
- Europe > United Kingdom > Wales (0.04)
- Government (0.95)
- Information Technology (0.69)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China > Hong Kong (0.04)
5bd529d5b07b647a8863cf71e98d651a-Supplemental.pdf
Kinetics-400 [1] is a large scale action recognition dataset with trimmed video clips of around 10-second durations. It is collected from realistic YouTube videos, which covers 400 categories of human activities. In total, it contains around240K training videos and20K validation videos. Specifically whentraining Kinetics-200/-400 from scratch, we adopt the cosine schedule of learning rate decaying with an initiallearningrateof0.1. The initial learning rate is 0.005anddecaysby 0.1atepoch20and40.
EEG2Video: Towards Decoding Dynamic Visual Perception from EEG Signals
Our visual experience in daily life are dominated by dynamic change. Decoding such dynamic information from brain activity can enhance the understanding of the brain's visual processing system. However, previous studies predominately focus on reconstructing static visual stimuli. In this paper, we explore to decode dynamic visual perception from electroencephalography (EEG), a neuroimaging technique able to record brain activity with high temporal resolution (1000 Hz) for capturing rapid changes in brains. Our contributions are threefold: Firstly, we develop a large dataset recording signals from 20 subjects while they were watching 1400 dynamic video clips of 40 concepts.
From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Zhang, Jiajie, Schwertfeger, Sören, Kleiner, Alexander
W e present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. T o our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
Sekai: A Video Dataset towards World Exploration
Li, Zhen, Li, Chuanhao, Mao, Xiaofeng, Lin, Shaoheng, Li, Ming, Zhao, Shitian, Xu, Zhaopan, Li, Xinyue, Feng, Yukang, Sun, Jianwen, Li, Zizhen, Zhang, Fanrui, Ai, Jiaxin, Wang, Zhixiang, Wu, Yuwei, He, Tong, Pang, Jiangmiao, Qiao, Yu, Jia, Yunde, Zhang, Kaipeng
Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning "world" in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Comprehensive analyses and experiments demonstrate the dataset's scale, diversity, annotation quality, and effectiveness for training video generation models. We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. The project page is https://lixsp11.github.io/sekai-project/.
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > United States (0.04)
- (4 more...)